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Abstract 

Finding a good clustering of vertices in a network, 
where vertices in the same cluster are more tightly 
connected than those in different clusters, is a useful, 
important, and well-studied task. Many clustering 
algorithms scale well, however they are not designed 
to operate upon internet-scale networks with billions 
of nodes or more. We study one of the fastest and 
most memory efficient algorithms possible - cluster- 
ing based on the connected components in a random 
edge-induced subgraph. When defining the cost of 
a clustering to be its distance from such a random 
clustering, we show that this surprisingly simple al- 
gorithm gives a solution that is within an expected 
factor of two or three of optimal with either of two 
natural distance functions. In fact, this approxima- 
tion guarantee works for any problem where there is 
a probability distribution on clusterings. We then ex- 
amine the behavior of this algorithm in the context 
of social network trust inference. 

1 Introduction 

Finding clusters or communities is one of the most 
important steps in network analysis. Clusters should 
have high internal connectivity and relatively low 
connectivity with the rest of the network. Find- 
ing such groups of similar or tightly connected ver- 
tices increases our understanding of the underlying 
graph [23J EH EQl [27l [6] , and many algorithms exist 
for clustering networks EU [El [Lj] . Because 
the networks we work with grow all of the time, some 
of these algorithms are specifically designed to per- 
form efficiently on large networks. We take this goal 
to its extreme by proposing a randomized network 
clustering algorithm which queries each edge at most 
once. We then derive approximation guarantees for 
the resulting clusterings and demonstrate its behav- 
ior on a number of real social networks. 

While our algorithm applies to networks from any 



number of domains (the internet, biological networks, 
etc.), our primary motivation comes from using in- 
ferred trust in social networks. With hundreds of 
millions of users on social networking websites and 
millions of pages of user-generated content coming 
on line every day, there are vast networks of users, 
content, and meta-data. Access to this type of infor- 
mation is extremely powerful. There is potential to 
personalize and enhance users' experiences and im- 
prove our understanding of users and their behavior. 
In particular, connecting social network data - es- 
pecially trust - to user-generated content allows sys- 
tems to direct users to the most trustworthy users 
and data. This may be through recommender sys- 
tems, search personalization, or direct presentation 
of trust information about other users. 

Clustering is an important challenge in this con- 
text. All the applications discussed above, and many 
more, can benefit from clustering over these networks. 
Motivated by these applications - particularly the 
problem of trust inference - our research addresses 
the issue of clustering the vertices in graphs. 

Using random graphs as a model, our goal is to find 
a clustering where vertices in a cluster are likely to 
be in the same connected component while vertices 
in different clusters are not. DuBois et al. [8] define 
the distance between two nodes to be the logarithm 
of the reciprocal of the probability that they are con- 
nected. Because computing this probability exactly is 
intractable (#P-complete [35]), they repeatedly sam- 
ple random graphs to estimate such probabilities to 
within any desired precision and confidence. If edges 
are chosen independently, this distance is a metric, 
and any one of a number of clustering algorithms can 
be applied. They show that this technique works well 
in some practical settings; however it has some draw- 
backs - most notably that many samples of the ran- 
dom graph are required to accurately estimate dis- 
tances between nodes, and hence the running time 
involved may be prohibitive for very large graphs. On 
the Web, where interesting graphs tend to be large, 
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this is a major issue. 

In this paper, we present a new method for graph 
clustering where every edge is mapped to an inde- 
pendent probability of its being in an instance of 
the graph. The connected components of the result- 
ing graphs, which we can sample with a depth first 
search, are its clusters. Our algorithm is computa- 
tionally efficient - only a single pass is needed. Fur- 
thermore, it applies not only to network clustering, 
but to any problem where clusterings come from any 
probability distribution which we can sample. 

To analyze this algorithm, we define a distance 
function between any two clusterings and attempts 
to minimize the expected distance between its out- 
put and a randomly sampled clustering. We show 
that good clusterings can be found in expectation di- 
rectly by sampling the random graph only once. We 
then show that repeated sampling improves our con- 
fidence in the result. In Section [3.11 we formalize the 
problem and prove that a single random sample gives 
a 3-approximation in expectation. In Section 13.31 we 
show how to use multiple samples to improve on our 
probabilistic guarantees. Finally in Section 2] we ap- 
ply our new algorithm to trust inference clustering as 
a demonstration of its usefulness. 

2 Related Work 

We begin our literature review with an overview of 
our target application - social network trust inference 
and the usage of trust-based clusters, and then move 
on to a discussion of other clustering algorithms. 

Since an individual in a social network usually 
knows only a tiny fraction of all the users, it is im- 
portant to have some mechanism for estimating the 
relative importance of unknown users. In many web- 
based applications that seek to personalize the user's 
experience, this will take the form of computing their 
influence or trustworthiness. Trust propagation is a 
particularly challenging problem because of the many 
social and interpersonal factors that play into trust. 

There are many trust inference algorithms that 
take advantage of given trust values and the structure 
of a social network, including Advogato [25], Apple- 
seed [39], Sunny [22], and Moletrust [2]. These al- 
gorithms use trust that is assigned on a continuous 
scale (e.g. 1-10). Trust can also be treated as a prob- 
ability. This approach has been used in a number of 
algorithms, including [T5J [U [29j [2Q]. The difficulty 
of generating these probabilities, using influence as 
a proxy for trust, was addressed in |14| . In our re- 
search, we work with probabilities that are given a 
priori, but those derived from other methods could 



also be used in our algorithms. 

The result of these algorithms have a wide range 
of applications. Recommender systems are a com- 
mon application, where computed trust values are 
used in place of traditional user similarity measures to 
compute recommendations (e.g. [28] [1] [TT] ) . In [15] . 
the authors present a technique for using trust to 
estimate the truth of information that is presented, 
which in turn has applications for assessing informa- 
tion quality, particularly on the Semantic Web. More 
specific applications of that idea include using trust 
for semantic web service composition j21j . 

Often these algorithms require, as an intermediate 
step, finding clusters of people who are more tightly 
connected to each other than to the remainder of the 
population [32] |9j . The art of finding useful sets of 
clusters has been well studied on a wide range of ap- 
plications. In some cases there is some (unknown) 
"ground-truth" clustering inherent in the data which 
we want to find, and the algorithms attempt to find a 
clustering that is "close" to the true one [313]. Often, 
though, there is no reason to believe that the data 
has inherently correct clusters, and the goal becomes 
simply to produce a clustering which works well in 
practice for a particular application. 

When each data point to be clustered consists 
of a vector of numerical values, one common tech- 
nique is to choose a distance function between the 
elements (Euclidean, Ll-norm, etc.) and look for 
clusters which minimize some optimization function. 
Examples of these algorithms include k-means [16] 
(which minimizes the mean squared-distance of ele- 
ments from their cluster centers), and k-centers |17| 
(which minimizes the maximum distance from any 
point to the center of a cluster). Typically approxi- 
mation algorithms, which find solutions close to op- 
timal, are used because it is impractical to compute 
the optimal clustering for these problems. For a more 
extensive overview of various clustering algorithms, 
see [37]. 

Much work has also been done specifically on clus- 
tering networks, and we give an overview of it here. 
Newman and Girvan |27j compute the shortest path 
between all pairs of nodes in the network, remove the 
edge used in the most such paths, and repeat. If an 
edge is contained in many shortest paths, then intu- 
itively there are not many other short paths around 
it, and it may be a bridge between natural clus- 
ters. Some edge removals will disconnect a com- 
ponent in the graph. The order in which compo- 
nents disconnect gives a hierarchical clustering of the 
network. They can then choose the level of hierar- 
chy which best suites some application-specific op- 
timization function. Because repeatedly computing 
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the shortest path between all pairs may not be effi- 
cient enough, Tang et al. 2011 [34! build on this work 
by using center distance to zone |35| as a more effi- 
cient approximation of shortest paths. Several tech- 
niques pick a node at random and attempt to build 
up a cluster around it by repeatedly adding similar 
nodes. Xu et al. [38] find structural clusters (clusters 
which many have low density, but have a core back- 
bone of nodes whose neighborhoods overlap greatly 
with other core nodes) . Jiang and Singh [18] propose 
a similar algorithm for clustering biological networks 
where at every step a currently active cluster expands 
by adding the "closest" node if its proximity exceeds 
some threshold. 

Frequently we cluster networks in order to find 
inherent communities in the data. Leskovec et al. 
perform an extensive study on the best communities 
of different sizes in many large social networks |23) . 
They use conductance (or the normalized cut met- 
ric [33]), defined as the ratio of edges between the 
community and the outside world to edges within the 
community, as a measure of community strength. For 
all of the networks they examine, regardless of size, 
maximum community conductance drops off consid- 
erably for community sizes greater than one hundred. 
This results suggests that there may be no clusterings 
of large social networks which help us understand 
the networks structure. However even if clustering 
such networks does not reveal anything important 
about them, it may still be useful in getting better 
application-specific results or efficiency. 

3 The Algorithm 

Recently DuBois et al.[8] proposed an interpretation 
of trust within a social network based upon taking a 
random edge-induced subgraph. In their framework, 
the direct trust on an edge corresponds to the prob- 
ability that the edge will be in a random instance of 
the graph and indirect trust between any two people 
in the network corresponds to the probability that 
there is a path between them. The ability to cluster 
the network into groups of relatively high trust ranks 
among their main contributions. They find these 
clusters by repeatedly sampling the random graph 
to estimate the path probability between all pairs of 
nodes, and then apply various well-studied clustering 
algorithms to the resulting distances. In order to have 
confidence that all of these pairwisc distances are ac- 
curate to within a tolerance of ±6, O(-^r) samples 
are needed on a network with n people. This poses 
a major drawback on internet-scale datasets, which 
can have millions or billions of users. 



Their solution takes a trust network, computes a 
distance function between pairs of points, and then 
uses those distances to find a clustering. This solu- 
tion scales fairly well, but we would like to do better. 
There is nothing inherent in clustering which requires 
computing distances as an intermediate step, which 
inspired us to skip the distance computation alto- 
gether. Our algorithm takes a single sample of a ran- 
dom graph and uses its connected component decom- 
position as the clustering. Sampling the graph and 
computing the connected components can be done 
simultaneously in a single pass over the edges of the 
graph using depth first search, and thus is as fast of 
an algorithm as we can reasonably hope for. 

Of course a fast algorithm that produces poor re- 
sults would not be useful. In Section 13.11 we derive 
probabilistic bounds on the quality of the resulting 
clusterings. We start by defining a distance function 
between clusterings with the goal of minimizing the 
distance between our chosen clustering and a con- 
nected component decomposition of an instance of 
the random graph. We do not expect to be able to 
find the best such clustering (and in the general case 
where sample clusters come not from a random graph 
but from a black box it is not possible), however our 
single sample algorithm achieves a 3-approximation 
in expectation. This means that a random clustering 
will, on average, produce distances no more than 3 
times those produced by the best clustering. We also 
show that using a slightly different distance function 
(one used by Balcan et al. [4]) our algorithm achieves 
a 2-approximation in expectation. Finally we show 
how to achieve improved bounds on the deviation 
from this expectation by sampling multiple times. 

3.1 Definitions and Formal Analysis 

Consider the following problem: 

• A clustering of a set U is a set C of subsets of U 
such that Va; S U, \{s € C : x 6 s}\ = 1. In other 
words, a clustering is a set of disjoint subsets 
whose union is the entire set U. For convenience 
we let a clustering contain an arbitrary number 
of distinct empty sets. 

• Given a set U, and a probability distribution V 
on clusterings of U, we want to find a cluster- 
ing X which minimizes the expected distances 
between X and a random clustering drawn from 
V. 

• There are many possible distance functions be- 
tween clusterings, we will concentrate on the fol- 
lowing one: for two clusterings X and Y, de- 
fine D X {Y) to be miUf-.Y^x J2 s£ y I s u /( s )I ~ 
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\s n /(s)|. In other words, for each set in Y, 
we match it to the set in X which minimizes the 
size of the symmetric difference between the two. 
We will later consider a similar distance func- 
tion proposed by Balcan et al[4]. They let the 
distance between two clusters be the minimum 
number of elements not in the matched clusters 
under any matching of the clusters. For any two 
clusters, the distance using our metric is at least 
the distance using theirs, and at most twice the 
distance using theirs. 

Note that while we do not restrict the function /, 
the optimal choices for / are bijections (when we in- 
clude an appropriate number of empty sets in the two 
clusterings). This follows from the observation that 
for any optimal /, and for all s £ Y, s n f(s) is at 
least half the size of both s and f(s). Otherwise it 
would be better to map s to the empty set. Since no 
two distinct Sx, S2 S X can both share more than half 
of the elements of a single y G F, / must be one-to- 
one. By mapping extra, distinct empty sets onto the 
remaining sets in X (if any), / becomes a bijections. 

For any given probability distribution on cluster- 
ings V (such as the one given by the connected com- 
ponent decomposition of a random graph), define 
random variable Y to be a clustering drawn from 
that distribution. Let C be a clustering that mini- 
mizes the expectation of Dq{Y). Since the distribu- 
tion V is a black box in general, we cannot hope to 
find the actual clustering C even if we assume the 
P = NP (because we can not distinguish between 
two distributions with complete certainty). However 
the following simple algorithm surprisingly gives a 3- 
approximation in expectation: 

Take a random sample C from V , and use that as 
the approximation. 

The analysis proceeds as follows: 

Let U,V, and C be given. Define gx{u) to be the 
set s in the clustering X that contains u and j\ to be 
the best function mapping clusters in Y to clusters 
in X. The expected cost of the optimal solution is 
E[Do(Y)) 

= E[Dc{C')\ 

= E E(l s u /c'OOM* n j£( s )|) 

Y sGY 

f fl{g Y {u)) = g c {u) 
= EE Pr M- 1 fc({}) = 9c(u) 

u£U Y [2 OW. 

= S P ( ^^ : /c'(9>-(«))^9c(«)Aff C («)#/ ( i'({}) Pr [ y ] j 

(1) 



Each element u adds to the total cost only if its set 
in Y does not map to its set in C. In that case it costs 
1 because of the mapping from <7r(it) to Jq (g Y (u)), 
and it costs another 1 if some non-empty set in Y 
maps to gc{u). 

The expected cost, E[Dc'{Y)], of our approxi- 
mated solution is derived in Figure [1] Where Equa- 
tion [2] maps s G Y to s' € C if and only if they 
both map to the same subset in C . This mapping 
must cost at least as much as the optimal map- 
ping /p, . Dividing Equation [3] by Equation [1] gives 
E[D C '(Y)/Dc(Y)}<3. 

We demonstrate that this upper bound is tight 
with the following distribution on clusterings: 

Pr[Y = {{!}, {2}}] = Pr[Y = {{1, 2}}] = 1 

The optimal solution simply matches the high proba- 
bility case, C = {{1},{2}}. The expected cost of this 
solution is ((k — 1) • + 1 • l)/k. The expected cost 
of using a random sample is 

(fc-i)-(^-o + i-i) + i-(V-2 + j-o) 
k 

which reduces to 3 • , and thus approaches 3 times 
optimal as k — s> 00. 

3.2 Expected 2- Approximation 

We briefly consider the case where we change the dis- 
tance function between clusterings to 

D C (Y) = min|{ U : f(g Y (u)) ± g c (u)}\ 

(as used by [1]) and keep all definitions and nota- 
tions the same as above. This distance function costs 
exactly 1 for each element u whose set in Y is not 
mapped to its set in C. Using this metric, our algo- 
rithm yields a 2-approximation in expectation. We 
show this by rewriting the distance function as 

E[D C {Y)]=Y J E Pr M- 

11 Y:^{g Y {u))=g c (u) 

Balcan et al. [3] observe that this function is sym- 
metric and obeys the triangle inequality. There- 
fore the expected distance E[D C >(Y)] < E[D C (C) + 
D C [Y)} = 2E[D C (Y)}. 

3.3 Multiple Samples 

Depending on the application, the guarantee of a 3- 
approximation in expectation may not be sufficient. 
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Figure 1: Derivations showing that our algorithm gives a 3- approximation in expectation. 



An unlikely sample could have arbitrarily bad behav- 
ior. For example, the probability that a sample is 
better than a 5-approximation is not guaranteed to 
be any higher than 1/2. In this subsection we ex- 
plore various ways to use multiple samples to achieve 
better approximation guarantees. 

Since our approximation guarantee is in expecta- 
tion, it is important to limit the probability that we 
choose a bad clustering C (one where E[Dc(Y)] is 
much greater than 3E[Dc(Y)]. We do this using 
Markov's inequality. Since the approximation ratio 
is always at least 1 (no solution can be better than 
the optimal solution), 

Pr[E[D c >(Y)] > (3 + 2e)E[D c (Y)} < 1/(1 + e). 

We could attempt to bound the variance in the ap- 
proximation ratio, however as the above example il- 
lustrates (or any example where most of the prob- 
ability mass lies close to the optimal solution, with 
a small amount of mass at a large distance), it can 
be quite bad. Through repeated sampling, we can 
do much better. Instead of taking only a single sam- 
ple clustering, let us take samples C[ , . . . , C' m from 
the distribution. The first quantity of interest is the 
approximation ratio R achieved by the best of these 
samples - min,Dc'(X). Since the sampled C^'s are 
independent, 

R = Pr[mmD C '{X)/E[D c (X)] > 3 + 2e] 

i % 

< l[Pr[D Ci (X)/E[Dc(Y)] > 3 + 2e] 

i 

< 1/(1 + e) m . 



Thus if we want at most a r probability of hav- 
ing no samples within this distance, we need m = 
|~log 1+e 1/t] or more samples. 

The existence of a sample C- which is close to a 
3-approximation does not directly imply that we can 
determine which of the sample(s) are good. If our 
application allows us to test each of the samples and 
choose one with the best results, we may not need 
to find the one with the best approximation ratio. 
Otherwise, to be certain which of the C[ gives the 
best approximation ratio, we would need to know C 
already (or at least be able to calculate Dc{X) know- 
ing X but not C). We can get around this by taking I 
additional samples {X\, ...,X{\, and computing for 
each of the the total distance from the X,'s. We 
then select the C[ with the minimum such total dis- 
tance. 

We must now consider that the samples X\ , . . . , Xi 
that we draw might give a total distance larger than 
its expectation for the "good" C[ and smaller than 
the expectation for a "bad" C[. To address this, 
we now show that when I is sufficiently large, with 
high probability even if we don't select the best C^, 
we will select one which is close enough. For each 
the expected cost E[D C ,(X)} > E[D C (X)], and 
with probability at least 1 — r, for at least one such 
C'i, E[D C '.(X)] < (3 + 2e) • E[D C {X)]. Wc define 

di = Ylj=i -DciiXj)/]!!], Since the di are a sum 
of independent random variables taking on values in 
[0, 1], we can apply Chernoff bounds to their devia- 
tion probability. If we take I = (\U\/8 2 ) ■ O(logfcp), 
then with probability at least 1 — p all candidate's 
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distance totals di will be estimated to within (1 ± S). 
If there exists a candidate with distance total 

D c ,(Y)<E[D c (Y)]-(3 + 2e)-l, 

and all such estimates are within (1±<$) of the true to- 
tals, then the candidate with the minimum estimated 
total has a true total at most -(3 + 2e)E[Dc(Y)] ■ I ■ 
(1 + 2S). This candidate gives an approximation ratio 
of (3 + 2e) • (1 + 2<5). Such a candidate is found with 
probability at least 1 — p — r. 

4 Trust Inference Application 

Having a 3-approximation algorithm is a nice theo- 
retical result, but it does not necessarily imply prac- 
tical benefits. For example, if an optimal solution has 
a large expected distance (1/3 of the maximum dis- 
tance for example) , then a 3-approximation is mean- 
ingless. The hope is that networks will only have such 
bad behavior if they are inherently not well cluster- 
able. There is some intuitive reason to believe this 
is so. If a certain set of nodes often forms the ma- 
jority of a connected component and they are in the 
same component c of an optimal clustering, then a 
random clustering Y will likely have a component y 
that matches with low cost to c. Meanwhile if the op- 
timal clustering has high cost, that means that few 
large groups of nodes consistently form the bulk of a 
component. 

In this section we explore what kind of clusterings 
occur in real trust networks using various parame- 
ters. We examine the Trust Project, FilmTrust fL2]. 
and Epinions networks. Visualizations of these net- 
works are shown in Figure [5] with their sizes shown 
in Table [TJ In the first two of these networks, users 
rate their level of trust (with respect to a specified do- 
main) in their friends. In the Epinions network, users 
rate whether they like or dislike statements made by 
others, and these ratings can be used as a proxy for 
ratings of the statements' authors. In this paper we 
address only positive trust, so unfavorable ratings are 
discarded. 

The Trust Project network is derived from an early 
Semantic Web trust network [13] . As is shown in Fig- 
ure [21 it has many star patterns. This occurs when 
users make connections to many friends who do not, 
in turn, participate in the network. Thus, they have 
no outgoing connections. This affects our ability to 
cluster the network. The FilmTrust network is built 
from a social network in which users rate movies and 
how much they trust their friends in that context. As 
the visualization shows, it has a more traditional net- 
work structure. However, there are a number of small 




Figure 2: The three networks used in our analysis 
have very different structures. The Trust Project 
Network (top) has many star formations which af- 
fect the quality of its clusters. The FilmTrust Net- 
work (middle) is a more traditionally organized social 
network. The Epinions Network (bottom) is much 
larger. 
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Network Nodes Edges Density 

Trust Project 62 105 0.055 

FilmTrust 310 774 0.016 

Epinions 114,467 717,667 0.0001 



Table 1: The size and descriptive statistics of our 
three example networks. Density is calculated as the 
ratio of edges to possible edges. 

groups that are disconnected from the giant compo- 
nent. These are shown as the small subnetworks, of- 
ten pairs, floating around the edge of the visualiza- 
tion. Finally, the Epinions network shows social net- 
work connections on the product review site. Trust 
ratings indicate how much they trust one another's 
reviews. 

In all of our networks, ratings form directed trust 
edges. Our first step is to create an undirected and 
normalized trust graph. We convert every lone di- 
rected edge into an undirected edge, and whenever 
two people rate each other, we average their ratings to 
form a single undirected edge. The edge weights are 
then normalized to fall between 1 and 10. Since we 
need probabilities on the edges instead of weights, we 
introduce a global parameter t. An edge with weight 
w gets probability max(l, iu/f). Therefore when t is 
small, edge probabilities tend to be high and con- 
nected components will be large, and for large t edge 
probabilities and connected components will gener- 
ally be smaller. 

For the Trust Project and FilmTrust networks we 
vary t from 2 to 30, generating 30 sample clusterings 
for each value. For the Epinions network, we need 
considerably higher values of t to capture the same 
behavior, so we use a range of 14 to 50. In Figure [3] 
we show the frequencies of each component size and 
each component's benefit, where we define the benefit 
of a cluster to be its size minus its cost (or how much 
less it costs than its maximum possible cost). Due to 
our choice of cost function, two clusters each have to 
share at least half of their nodes to have any benefit 
at all. The x-axis shows the value of t, the y-axis 
shows the component size (or benefit), and the circle 
diameters show the how many components are that 
size (or benefit) in our samples. This gives a sense of 
what size clusters to expect for different values of t. 
Values of t < 10 are included for informational pur- 
poses, but may be poor choices in practice, because 
they give equal weight to all user trust ratings > t. 

Figure H] contains scatter plots of the distances be- 
tween pairs of randomly sampled clusterings for all 3 
datasets. As discussed in Section |3~T1 the expected 
distance between two randomly chosen clusterings 



is at most 3 times optimal. So these plots demon- 
strate roughly how similar clusterings are, and what 
range E[Dc{X)] falls into. When t — > 0, the ran- 
dom graphs lose their randomness and are always 
connected. Conversely at t — > oo, the graphs are 
always disconnected. Therefore at the two extremes 
distances will be 0. Of interest here is the shape of 
the curve in between, and specifically for what values 
of t are there good, representative clusterings. 

From Figures [3] and [U it is evident that the Trust 
Project network does not produce particularly sta- 
ble clusters. Most of the benefits, even for relatively 
small values of t, are quite small when compared to 
the larger cluster sizes. This means that there is not 
much more than 1/2 overlap between matched clus- 
ters. Instead most of it's benefit comes from small 
clusters matching up well. This may be (in part) a 
product of star shaped connections. Under the right 
conditions a star graph should form a single cluster. 
But with our algorithm it will form a random large 
cluster and many singletons, which will have high dis- 
tance from another such random clustering. 

The FilmTrust network creates considerably more 
consistent clusterings. Much more of the benefit 
comes from a large, fairly consistent cluster, but con- 
siderable benefit comes from smaller clusters as well. 
Even with t as high as 20 (which corresponds to max- 
imum trust giving only a 1/2 edge probability), there 
are still large clusters that share a considerable core 
component, indicating a very stable cluster. For this 
network as well as Trust Project, the shape of the 
cost curves largely depends on the giant component. 
If it exists and is stable for a given t, the costs are 
low. If it exists but changes significantly with differ- 
ent samples, then some pairs have low cost, and some 
have high cost. 

In this Epinions dataset, our algorithm consistently 
identifies a stable giant component and a number of 
smaller components of widely varying stability. This 
clustering behavior is particularly useful for applica- 
tions that use the trust values to boost performance. 
Trust values can be used in recommender systems 
to generate predictive ratings based on a user's so- 
cial connections [llj . Integrating information about 
trust clusters into traditional recommender systems 
can significantly improve the accuracy of recommen- 
dations [5J. The smaller groups of size <30 that iden- 
tified may reflect the types of niche interest groups 
that benefit most from the trust clustering recom- 
mendation techniques. 
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Figure 3: The top two plots show component sizes (blue) and benefit sizes (green) within Trust Project (left) 
and FilmTrust (right) for t = 2 to 30 with 30 samples of each. The bottom two plots show the component 
sizes and benefits for the Epinions dataset, with the left plot showing small clusters and the right plot the 
largest clusters. A circle centered at (x, y) with radius r indicates that the number of clusters of size (or 
benefit) y with t = x is proportional to r. 
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5 Conclusion 
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In this paper, we introduce a simple and extremely ef- 
ficient network clustering algorithm, mathematically 
derive bounds on its expected approximation ratio 
and probability of substantial deviation from this ra- 
tio, and demonstrate its application in social trust 
networks. We treat trust as a probability on edges in 
the network and present an algorithm that only re- 
quires the ability to sample clusters from a black-box 
probability distribution rather than explicitly com- 
puting distance in the network. We then prove that 
this is a 3-approximation algorithm with good theo- 
retical performance. To show the practical applica- 
tions, we test the algorithm on three real-world social 
trust networks. 

This work has applications for mining social net- 
works, recommender systems, content filtering, and 
more. Future work will explore on what types of net- 
works this algorithm is most effective as well as which 
applications can benefit the most from its. 
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Figure 4: This figure shows costs between ran- 
domly sampled clusterings for Trust Project (top), 
FilmTrust (middle), and Epinions (bottom) net- 
works. The maximum distance between samples in 
the smaller two networks is approximately the size of 
the network, whereas the maximum distance in the 
Epinions network is roughly half of its network size. 
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